AITopics

Technology: Information Technology > Artificial Intelligence (0.76)

Neural Information Processing SystemsFeb-16-2026, 22:48:20 GMT

9b9cfd5428153ccfbd4ba34b7e007305-Paper-Conference.pdf

With advances in the quality of text-to-image (T2I) models has come interest in benchmarking their prompt faithfulness --the semantic coherence of generated images to the prompts they were conditioned on. A variety of T2I faithfulness metrics have been proposed, leveraging advances in cross-modal embeddings and vision-language models (VLMs).

large language model, machine learning, natural language, (18 more...)

Country:

Europe > Switzerland > Zürich > Zürich (0.14)
North America > United States > California > Santa Barbara County > Santa Barbara (0.04)
North America > Dominican Republic (0.04)
(2 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Sensing and Signal Processing > Image Processing (0.94)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.70)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Kong, Liangji, Joshi, Aditya, Karimi, Sarvnaz

CAIRNS: Balancing Readability and Scientific Accuracy in Climate Adaptation Question Answering

arXiv.org Artificial IntelligenceDec-3-2025

Climate adaptation strategies are proposed in response to climate change. They are practised in agriculture to sustain food production. These strategies can be found in unstructured data (for example, scientific literature from the Elsevier website) or structured (heterogeneous climate data via government APIs). We present Climate Adaptation question-answering with Improved Readability and Noted Sources (CAIRNS), a framework that enables experts -- farmer advisors -- to obtain credible preliminary answers from complex evidence sources from the web. It enhances readability and citation reliability through a structured ScholarGuide prompt and achieves robust evaluation via a consistency-weighted hybrid evaluator that leverages inter-model agreement with experts. Together, these components enable readable, verifiable, and domain-grounded question-answering without fine-tuning or reinforcement learning. Using a previously reported dataset of expert-curated question-answers, we show that CAIRNS outperforms the baselines on most of the metrics. Our thorough ablation study confirms the results on all metrics. To validate our LLM-based evaluation, we also report an analysis of correlations against human judgment.

large language model, natural language, question answering, (16 more...)

2512.02251

Country: Oceania > Australia > New South Wales > Sydney (0.16)

Genre: Research Report > New Finding (0.46)

Industry: Food & Agriculture > Agriculture (0.69)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Question Answering (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.96)

Neural Information Processing SystemsOct-10-2025, 11:08:11 GMT

9b9cfd5428153ccfbd4ba34b7e007305-Paper-Conference.pdf

evaluation, metric, seg, (14 more...)

Country:

Europe > Switzerland > Zürich > Zürich (0.14)
North America > United States > California > Santa Barbara County > Santa Barbara (0.04)
North America > Dominican Republic (0.04)
(2 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Sensing and Signal Processing > Image Processing (0.94)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.70)
(2 more...)

Neural Information Processing SystemsMay-27-2025, 10:23:46 GMT

Who Evaluates the Evaluations? Objectively Scoring Text-to-Image Prompt Coherence Metrics with T2IScoreScore (TS2)

prompt faithfulness, scoring text-to-image prompt coherence metric, t2iscorescore, (5 more...)

Technology: Information Technology > Artificial Intelligence (0.80)

arXiv.org Artificial IntelligenceMar-19-2025

Automated Processing of eXplainable Artificial Intelligence Outputs in Deep Learning Models for Fault Diagnostics of Large Infrastructures

Floreale, Giovanni, Baraldi, Piero, Zio, Enrico, Fink, Olga

Deep Learning (DL) models processing images to recognize the health state of large infrastructure components can exhibit biases and rely on non-causal shortcuts. eXplainable Artificial Intelligence (XAI) can address these issues but manually analyzing explanations generated by XAI techniques is time-consuming and prone to errors. This work proposes a novel framework that combines post-hoc explanations with semi-supervised learning to automatically identify anomalous explanations that deviate from those of correctly classified images and may therefore indicate model abnormal behaviors. This significantly reduces the workload for maintenance decision-makers, who only need to manually reclassify images flagged as having anomalous explanations. The proposed framework is applied to drone-collected images of insulator shells for power grid infrastructure monitoring, considering two different Convolutional Neural Networks (CNNs), GradCAM explanations and Deep Semi-Supervised Anomaly Detection. The average classification accuracy on two faulty classes is improved by 8% and maintenance operators are required to manually reclassify only 15% of the images. We compare the proposed framework with a state-of-the-art approach based on the faithfulness metric: the experimental results obtained demonstrate that the proposed framework consistently achieves F_1 scores larger than those of the faithfulness-based approach. Additionally, the proposed framework successfully identifies correct classifications that result from non-causal shortcuts, such as the presence of ID tags printed on insulator shells.

artificial intelligence, explanation, machine learning, (19 more...)

2503.15415

Country:

Europe > Switzerland > Vaud > Lausanne (0.04)
Europe > Italy > Lombardy > Milan (0.04)
Europe > France (0.04)

Genre:

Overview (0.66)
Research Report > Promising Solution (0.34)

Industry: Energy > Power Industry (0.86)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Zaman, Kerem, Srivastava, Shashank

A Causal Lens for Evaluating Faithfulness Metrics

arXiv.org Artificial IntelligenceFeb-26-2025

Large Language Models (LLMs) offer natural language explanations as an alternative to feature attribution methods for model interpretability. However, despite their plausibility, they may not reflect the model's internal reasoning faithfully, which is crucial for understanding the model's true decision-making processes. Although several faithfulness metrics have been proposed, a unified evaluation framework remains absent. To address this gap, we present Causal Diagnosticity, a framework to evaluate faithfulness metrics for natural language explanations. Our framework employs the concept of causal diagnosticity, and uses model-editing methods to generate faithful-unfaithful explanation pairs. Our benchmark includes four tasks: fact-checking, analogy, object counting, and multi-hop reasoning. We evaluate a variety of faithfulness metrics, including post-hoc explanation and chain-of-thought-based methods. We find that all tested faithfulness metrics often fail to surpass a random baseline. Our work underscores the need for improved metrics and more reliable interpretability methods in LLMs.

computational linguistic, explanation, faithfulness metric, (9 more...)

2502.18848

Country:

North America > United States (0.46)
Europe > France (0.05)
Europe > Germany (0.04)
(13 more...)

Genre: Research Report > New Finding (0.68)

Industry: Leisure & Entertainment (0.46)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

arXiv.org Artificial IntelligenceOct-30-2024

On Positional Bias of Faithfulness for Long-form Summarization

Wan, David, Vig, Jesse, Bansal, Mohit, Joty, Shafiq

Large Language Models (LLMs) often exhibit positional bias in long-context settings, under-attending to information in the middle of inputs. We investigate the presence of this bias in long-form summarization, its impact on faithfulness, and various techniques to mitigate this bias. To consistently evaluate faithfulness, we first compile a benchmark of eight human-annotated long-form summarization datasets and perform a meta-evaluation of faithfulness metrics. We show that LLM-based faithfulness metrics, though effective with full-context inputs, remain sensitive to document order, indicating positional bias. Analyzing LLM-generated summaries across six datasets, we find a "U-shaped" trend in faithfulness, where LLMs faithfully summarize the beginning and end of documents but neglect middle content. Perturbing document order similarly reveals models are less faithful when important documents are placed in the middle of the input. We find that this behavior is partly due to shifting focus with context length: as context increases, summaries become less faithful, but beyond a certain length, faithfulness improves as the model focuses on the end. Finally, we experiment with different generation techniques to reduce positional bias and find that prompting techniques effectively direct model attention to specific positions, whereas more sophisticated approaches offer limited improvements. Our data and code are available in https://github.com/meetdavidwan/longformfact.

computational linguistic, faithfulness, summarization, (15 more...)

2410.23609

Country:

North America > Mexico > Mexico City > Mexico City (0.04)
North America > Canada > Ontario > Toronto (0.04)
Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
(15 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Zhang, Weijia, Aliannejadi, Mohammad, Pei, Jiahuan, Yuan, Yifei, Huang, Jia-Hong, Kanoulas, Evangelos

A Comparative Analysis of Faithfulness Metrics and Humans in Citation Evaluation

arXiv.org Artificial IntelligenceAug-22-2024

Large language models (LLMs) often generate content with unsupported or unverifiable content, known as "hallucinations." To address this, retrieval-augmented LLMs are employed to include citations in their content, grounding the content in verifiable sources. Despite such developments, manually assessing how well a citation supports the associated statement remains a major challenge. Previous studies tackle this challenge by leveraging faithfulness metrics to estimate citation support automatically. However, they limit this citation support estimation to a binary classification scenario, neglecting fine-grained citation support in practical scenarios. To investigate the effectiveness of faithfulness metrics in fine-grained scenarios, we propose a comparative evaluation framework that assesses the metric effectiveness in distinguishing citations between three-category support levels: full, partial, and no support. Our framework employs correlation analysis, classification evaluation, and retrieval evaluation to measure the alignment between metric scores and human judgments comprehensively. Our results indicate no single metric consistently excels across all evaluations, highlighting the complexity of accurately evaluating fine-grained support levels. Particularly, we find that the best-performing metrics struggle to distinguish partial support from full or no support. Based on these findings, we provide practical recommendations for developing more effective metrics.

evaluation, faithfulness metric, scenario, (13 more...)

2408.12398

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Netherlands > North Holland > Amsterdam (0.05)
North America > United States > District of Columbia > Washington (0.05)
(7 more...)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Zhang, Weijia, Aliannejadi, Mohammad, Yuan, Yifei, Pei, Jiahuan, Huang, Jia-Hong, Kanoulas, Evangelos

Towards Fine-Grained Citation Evaluation in Generated Text: A Comparative Analysis of Faithfulness Metrics

arXiv.org Artificial IntelligenceJun-21-2024

Large language models (LLMs) often produce unsupported or unverifiable information, known as "hallucinations." To mitigate this, retrieval-augmented LLMs incorporate citations, grounding the content in verifiable sources. Despite such developments, manually assessing how well a citation supports the associated statement remains a major challenge. Previous studies use faithfulness metrics to estimate citation support automatically but are limited to binary classification, overlooking fine-grained citation support in practical scenarios. To investigate the effectiveness of faithfulness metrics in fine-grained scenarios, we propose a comparative evaluation framework that assesses the metric effectiveness in distinguishinging citations between three-category support levels: full, partial, and no support. Our framework employs correlation analysis, classification evaluation, and retrieval evaluation to measure the alignment between metric scores and human judgments comprehensively. Our results show no single metric consistently excels across all evaluations, revealing the complexity of assessing fine-grained support. Based on the findings, we provide practical recommendations for developing more effective metrics.

computational linguistic, evaluation, faithfulness metric, (14 more...)

2406.15264

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > Canada > Ontario > Toronto (0.04)
Asia > Singapore (0.04)
(10 more...)

Genre: Research Report > New Finding (1.00)

Industry: Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)